10 research outputs found
Spatio-temporal video autoencoder with differentiable memory
We describe a new spatio-temporal video autoencoder, based on a classic
spatial image autoencoder and a novel nested temporal autoencoder. The temporal
encoder is represented by a differentiable visual memory composed of
convolutional long short-term memory (LSTM) cells that integrate changes over
time. Here we target motion changes and use as temporal decoder a robust
optical flow prediction module together with an image sampler serving as
built-in feedback loop. The architecture is end-to-end differentiable. At each
time step, the system receives as input a video frame, predicts the optical
flow based on the current observation and the LSTM memory state as a dense
transformation map, and applies it to the current frame to generate the next
frame. By minimising the reconstruction error between the predicted next frame
and the corresponding ground truth next frame, we train the whole system to
extract features useful for motion estimation without any supervision effort.
We present one direct application of the proposed framework in
weakly-supervised semantic segmentation of videos through label propagation
using optical flow
Joint A Contrario Ellipse and Line Detection.
This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/TPAMI.2016.2558150We propose a line segment and elliptical arc detector that produces a reduced number of false detections on various types of images without any parameter tuning. For a given region of pixels in a grey-scale image, the detector decides whether a line segment or an elliptical arc is present (model validation). If both interpretations are possible for the same region, the detector chooses the one that best explains the data (model selection ). We describe a statistical criterion based on the a contrario theory, which serves for both validation and model selection. The experimental results highlight the performance of the proposed approach compared to state-of-the-art detectors, when applied on synthetic and real images.This work was partially funded by the Qualcomm postdoctoral program at École Polytechnique Palaiseau, a Google Faculty Research Award, the Marie Curie grant CIG-334283-HRGP, a CNRS chaire d’excellence and chaire Jean Marjoulet, and EPSRC grant EP/L010917/1
SceneNet: Understanding Real World Indoor Scenes With Synthetic Data
Scene understanding is a prerequisite to many high level tasks for any
automated intelligent machine operating in real world environments. Recent
attempts with supervised learning have shown promise in this direction but also
highlighted the need for enormous quantity of supervised data --- performance
increases in proportion to the amount of data used. However, this quickly
becomes prohibitive when considering the manual labour needed to collect such
data. In this work, we focus our attention on depth based semantic per-pixel
labelling as a scene understanding problem and show the potential of computer
graphics to generate virtually unlimited labelled data from synthetic 3D
scenes. By carefully synthesizing training data with appropriate noise models
we show comparable performance to state-of-the-art RGBD systems on NYUv2
dataset despite using only depth data as input and set a benchmark on
depth-based segmentation on SUN RGB-D dataset. Additionally, we offer a route
to generating synthesized frame or video data, and understanding of different
factors influencing performance gains
SynthCam3D: Semantic Understanding With Synthetic Indoor Scenes
We are interested in automatic scene understanding from geometric cues. To
this end, we aim to bring semantic segmentation in the loop of real-time
reconstruction. Our semantic segmentation is built on a deep autoencoder stack
trained exclusively on synthetic depth data generated from our novel 3D scene
library, SynthCam3D. Importantly, our network is able to segment real world
scenes without any noise modelling. We present encouraging preliminary results
Détection et identification de structures elliptiques en images : Paradigme et algorithmes
Cette thèse porte sur différentes problématiques liées à la détection, l'ajustement et l'identification de structures elliptiques en images. Nous plaçons la détection de primitives géométriques dans le cadre statistique des méthodes a contrario afin d'obtenir un détecteur de segments de droites et d'arcs circulaires/elliptiques sans paramètres et capable de contrôler le nombre de fausses détections. Pour améliorer la précision des primitives détectées, une technique analytique simple d'ajustement de coniques est proposée ; elle combine la distance algébrique et l'orientation du gradient. L'identification d'une configuration de cercles coplanaires en images par une signature discriminante demande normalement la rectification Euclidienne du plan contenant les cercles. Nous proposons une technique efficace de calcul de la signature qui s'affranchit de l'étape de rectification ; elle est fondée exclusivement sur des propriétés invariantes du plan projectif, devenant elle même projectivement invarianteThis thesis deals with different aspects concerning the detection, fitting, and identification of elliptical features in digital images. We put the geometric feature detection in the a contrario statistical framework in order to obtain a combined parameter-free line segment, circular/elliptical arc detector, which controls the number of false detections. To improve the accuracy of the detected features, especially in cases of occluded circles/ellipses, a simple closed-form technique for conic fitting is introduced, which merges efficiently the algebraic distance with the gradient orientation. Identifying a configuration of coplanar circles in images through a discriminant signature usually requires the Euclidean reconstruction of the plane containing the circles. We propose an efficient signature computation method that bypasses the Euclidean reconstruction; it relies exclusively on invariant properties of the projective plane, being thus itself invariant under perspectiv
Recommended from our members
Spatio-temporal video autoencoder with differentiable memory
We describe a new spatio-temporal video autoencoder, based on a classic
spatial image autoencoder and a novel nested temporal autoencoder. The temporal
encoder is represented by a differentiable visual memory composed of
convolutional long short-term memory (LSTM) cells that integrate changes over
time. Here we target motion changes and use as temporal decoder a robust
optical flow prediction module together with an image sampler serving as
built-in feedback loop. The architecture is end-to-end differentiable. At each
time step, the system receives as input a video frame, predicts the optical
flow based on the current observation and the LSTM memory state as a dense
transformation map, and applies it to the current frame to generate the next
frame. By minimising the reconstruction error between the predicted next frame
and the corresponding ground truth next frame, we train the whole system to
extract features useful for motion estimation without any supervision effort.
We present one direct application of the proposed framework in
weakly-supervised semantic segmentation of videos through label propagation
using optical flow
Recommended from our members
SynthCam3D: Semantic Understanding With Synthetic Indoor Scenes
We are interested in automatic scene understanding from geometric cues. To
this end, we aim to bring semantic segmentation in the loop of real-time
reconstruction. Our semantic segmentation is built on a deep autoencoder stack
trained exclusively on synthetic depth data generated from our novel 3D scene
library, SynthCam3D. Importantly, our network is able to segment real world
scenes without any noise modelling. We present encouraging preliminary results
Scene Structure Inference through Scene Map Estimation
International audienceUnderstanding indoor scene structure from a single RGB image is useful for a wide variety of applications ranging from the editing of scenes to the mining of statistics about space utilization. Most efforts in scene understanding focus on extraction of either dense information such as pixel-level depth or semantic labels, or very sparse information such as bounding boxes obtained through object detection. In this paper we propose the concept of a scene map, a coarse scene representation, which describes the locations of the objects present in the scene from a top-down view (i.e., as they are positioned on the floor), as well as a pipeline to extract such a map from a single RGB image. To this end, we use a synthetic rendering pipeline, which supplies an adapted CNN with virtually unlimited training data. We quantitatively evaluate our results, showing that we clearly outperform a dense baseline approach, and argue that scene maps provide a useful representation for abstract indoor scene understanding